Brain Stroke Dataset - Analysis, Part III

Author: Jakub Bednarz

Previous parts: Part I, Part II.

Introduction

Previously, we have attempted to explain the decisions our predictive models take with SHAP values. In this report, we will approach the same task with a different method - Local Interpretable Model-agnostic Explanations (LIME). To be specific, we will:

  1. Understand how LIME operates;
  2. See it "in action" by explaining selected predictions using it;
  3. Compare its predictions with those obtained by SHAP;
  4. Check how the explanations differ across different classes of models.

LIME explained

Local Interpretable Model-agnostic Explanations aims to provide explanations for observations by training local surrogate models - simpler, interpretable models which are fitted to approximate the behavior of the black-box model in the vicinity of a given observation. To give an analogy, this is similar to how one can approximate a manifold at a point with the tangent space. Specifically, LIME:

  • first, transforms the data instances into an interpretable feature space;
  • approximates the black-box model at a point/observation with a local surrogate model (usually a linear model) - the fitting of this surrogate model is done on artificial points obtained by perturbing the original observation.

Mathematically, we may rephrase it as $$\hat{g} = \text{arg min}_{g \in G} L(f, g, \pi(x)) + \Omega(g)$$ where $G$ denotes the class of surrogate models (for example, decision trees or linear regression models), $L$ is a loss function which measures the discrepancy between the black-box $f$ and $g$ in the neighborhood $\pi(x)$, and $\Omega(g)$ denotes the penalty for the complexity of $g$ (for example, decision tree depth or the sparsity of a linear model.)

LIME explanations for the brain stroke dataset

Let's now compute some explanations. We shall use the lime Python package.

To get a feel for what kind of explanations LIME provides, let's start with a single observation.

4379 2473
gender Male Male
age 31.0 66.0
hypertension No Yes
heart_disease No No
ever_married No Yes
work_type Self-employed Private
Residence_type Rural Urban
avg_glucose_level 64.85 82.91
bmi 23.0 28.9
smoking_status Unknown formerly smoked
stroke 0 0

Before we continue, we must remark on a certain aspect of LIME - it is important to choose an appropriate interpretable feature space for the data points. For example, when dealing with images, it would be unhelpful to get some kind of feature importance for the value of a single pixel in an image - we would far rather be interested in more high-level features, such as an object of a given class being present in the image, or having some property (say, being red or being oriented in a particular fashion). Because we are dealing with tabular data, however, and further taking into account what the columns are, it seems unnecessary to further refine the features, as they are already sufficiently high-level and interpretable.

With that clarified, let us look at the output of the lime package:

What we are looking at are:

  • on the left, the output of the tree model, i.e. predicted probabilities for not having (0) or having (1) a brain stroke;
  • in the middle, the contributions from each variable (note that lime discretizes continuous variables by default), or, to be more specific, the coefficients for the most important features in the local linear model trained on points in the vicinity of the selected observation, with the target being the probability measure;
  • on the right, the original values of the variables (that is, before discretization.)

Let us now comment on the results themselves, and how to interpret them. The predicted probability of having a stroke is low (12%), and according to the local model the most significant factor in it being low is the value of age being between 26 and 45 - to be specific, it reduces the probability from the baseline by 22 percentage points; the next one, subject not being ever married, reduces it by 8 percentage points; them not having hypertension reduces it by 6 percentage points, etc.

Let's look at a different example:

Here, the situtation is diametrically different - the predicted probability is 60%, with age being over 61 contributing 40 percentage points to the local surrogate's prediction, being ever married adding 7 percentage points etc.

Study of randomness of LIME

One interesting thing about LIME is that, because we sample points in the vicinity of the observation, its explanations inherently depend on the random seed. Thus, a question arises - to what extent are the explanations different as the seed differs? Let's investigate it:

  • for seed 0:
  • for seed 1:
  • for seed 99:

We can see that the explanations differ ever so slightly - to point out some of the differences, sometimes the ranks of hypertension=No and ever_married=No are swapped, likewise with glucose levels and BMI values. Whether that's an issue depends on the application, I would say.

Comparison with SHAP

Having now two methods in our arsenal, let's see whether (and if, how) the explanations they offer differ.

We will perform the investigation on the following data point:

1071
gender Female
age 49.0
hypertension No
heart_disease No
ever_married Yes
work_type Private
Residence_type Urban
avg_glucose_level 67.55
bmi 17.6
smoking_status formerly smoked
stroke 0

For them, lime yields a following explanation:

whereas shap outputs the following:

We can see that the results are fairly similar, but different - for example, SHAP attributes less to the lack of hypertension than LIME. Let's look at another example, just to check if this pattern holds:

1811
gender Female
age 5.0
hypertension No
heart_disease No
ever_married No
work_type children
Residence_type Rural
avg_glucose_level 109.4
bmi 20.0
smoking_status Unknown
stroke 0

It would seem (from the admittedly atrociously small sample of two) that there exist some definite differences between the explanations offered by SHAP and LIME.

LIME on logistic regression

As for SHAP, we will now look at what the explanations look like for the logistic regression model, and how they differ from the ones for the XGBoost. In this case, it's particularly interesting to see what it will look like, since a local linear approximation to a linear model is, well, the same model, with the caveat that logistic regression is linear in the log-odds, and not probabilities.

4598 2204 2843 1277
gender Male Female Female Male
age 54.0 29.0 24.0 12.0
hypertension No No No No
heart_disease No No No No
ever_married Yes No No No
work_type Govt_job Private Private children
Residence_type Urban Rural Urban Rural
avg_glucose_level 72.96 86.55 149.17 81.74
bmi 37.7 29.8 23.1 28.3
smoking_status smokes smokes never smoked Unknown
stroke 0 0 0 0

In my estimation, "for the most part" the features indicated to be most significant are the same as for the XGBoost model. One difference I've noticed is that the logistic regression seems to put far more stock into different work types - in contrast, for the tree model, it was virtually always negligible.

Now, let's look at the aforementioned "local approx of linear model is linear" hypothesis. One way to check it would be to see if, when trying to explain logits of the probabilities returned by the logistic regression model, we get the same attributions (read: coefficients) for different observations. When doing just that for four previous data points, we get:

The contributions for the same (possibly discretized) features are not the same. One can then wonder what could be the reason behind this - I surmise it's possible that the different data points used to fit the model, as well as the complexity penalty, could play a role in this discrepancy, though I cannot say for sure.